Oftentimes, datasets can be large with hundreds of thousands or even millions of rows. Provided that working memory is big enough to hold the entire dataset, R can handle it. Manipulating these large datasets with the standard R libraries can be a hassle. We will use the package data.table, which usually requires less memory and is faster than the standard data.frame format.
Data cleaning
First, we have to make sure that the data is in the right format. We can use the function str() to check the format of our variables and summary() for a first plausibility check.
# Check format of variables:
str(disp_events)
Classes ‘data.table’ and ‘data.frame’: 366 obs. of 5 variables: $ PATIENT.ID: int 42 42 42 42 42 42 42 42 42 42 … $ DATE.DISP : chr “14.01.2015” “07.03.2015” “03.10.2014” “03.12.2014” … $ ATC.CODE : chr “A02BC05” “A02BC05” “A02BC05” “A02BC05” … $ DOSE : chr “40 MG” “20 MG” “40 MG” “40 MG” … $ QUANTITY : int 28 14 28 28 28 28 28 1 96 50 … - attr(, “.internal.selfref”)= - attr(, “index”)= atomic
..- attr(*, “__PATIENT.ID“)= int
We can see that the DATE.DISP column is in CHARACTER instead of DATE format, and the DOSE is in CHARACTER format, too, because the unit is appended to it.
We convert the DATE.DISPto the appropriate format and extract the numeric part and the characters from the DOSEvariable into separate variables DOSE.numand UNIT. In addition, we convert the ATC-Code ATC.CODE to a factor variable. Let’s also convert the DOSE.num variable to NUMERIC and the UNIT variable to FACTOR to make sure that we have the same unit for all rows.
# Convert DATE to the DATE format and split DOSE into two variables:
disp_events[,`:=` (DATE.DISP = as.Date(DATE.DISP, format = "%d.%m.%Y"), #convert Date to date format
ATC.CODE = as.factor(ATC.CODE) #convert ATC-Code to factor variable
)]
disp_events[,c("DOSE.num", "UNIT"):= tstrsplit(DOSE, " ")] #split Dose on whitespace
disp_events[,DOSE := NULL]
# add the code to convert DOSE.num to numeric and UNIT to factor:
####your code here####
disp_events[,`:=` (DOSE.num = as.numeric(DOSE.num), #convert DOSE.num to numeric
UNIT = as.factor(UNIT) #convert UNIT to factor variable
)]
# Check format of variables:
str(disp_events)
Classes ‘data.table’ and ‘data.frame’: 366 obs. of 6 variables: $ PATIENT.ID: int 42 42 42 42 42 42 42 42 42 42 … $ DATE.DISP : Date, format: “2015-01-14” “2015-03-07” … $ ATC.CODE : Factor w/ 31 levels “A02BC02”,“A02BC05”,..: 2 2 2 2 2 2 2 5 15 14 … $ QUANTITY : int 28 14 28 28 28 28 28 1 96 50 … $ DOSE.num : num 4e+01 2e+01 4e+01 4e+01 4e+01 4e+01 4e+01 1e+05 5e+02 8e+02 … $ UNIT : Factor w/ 4 levels “MCG”,“MG”,“MICROG”,..: 2 2 2 2 2 2 2 4 2 2 … - attr(, “.internal.selfref”)= - attr(, “index”)= atomic
..- attr(*, “__PATIENT.ID“)= int
# Check summary of variables:
summary(disp_events)
PATIENT.ID DATE.DISP ATC.CODE QUANTITY
Min. :42.00 Min. :2014-07-01 A09AA02: 50 Min. : 1.00
1st Qu.:45.00 1st Qu.:2015-01-07 A02BC05: 47 1st Qu.: 28.00
Median :45.00 Median :2015-05-12 A11HA03: 35 Median : 60.00
Mean :44.91 Mean :2015-06-19 A05AA02: 23 Mean : 99.83
3rd Qu.:46.00 3rd Qu.:2016-01-08 J02AC02: 23 3rd Qu.:120.00
Max. :47.00 Max. :2016-07-09 R03DX05: 19 Max. :672.00
(Other):169
DOSE.num UNIT
Min. : 2.0 MCG : 3
1st Qu.: 56.2 MG :236
Median : 200.0 MICROG: 43
Mean : 17556.4 UI : 84
3rd Qu.: 800.0
Max. :1000000.0
Now that the data is in the right format, we can see if there are any implausible or missing data from the summary. We can see that all the dates appear to be from the years 2014-2016, which corresponds with our intended follow-up window. There are 31 different medications (as seen from the 31 factor levels of the ATC.CODEvariable). There are 4 different units in the UNIT variable: MG, MCG, MICROG, and UI. MCG and MICROG both refer to the same unit, microgrammes, so there should be only one version of this unit. We could change the original data and replace all instances of MCG with MICROG, but one of the data cleaning principles is to never change the original data. Instead, we will modify the data transparently and reproducibly in our script. If there are a lot of modificiations, this could be in a separate file that contains only the modifications.
# Assign *MCG* and *MICROG* to the same factor level
levels(disp_events$UNIT) <- list(MICROG=c("MCG", "MICROG"), MG="MG", UI="UI")
To calculate CMAs, AdhereR requires a DURATION for each dispensing event, but we only have the quantity. We could assume that patients need to administer one unit per day and use the QUANTITY variable, but for the medication of interest, this is clearly not appropriate. Sometimes, standard doses, e.g. WHO’s ‘Defined Daily Dose’ or other assumptions may be appropriate in some instances, but might introduce bias in other situations. For this example, we have a second database where the prescribed dosage for each medication per patient is recorded:
- patient unique identifier (
PATIENT.ID),
- event date (
DATE.PRESC; date of prescription, in the “mm/dd/yyyy” format),
- ATC code (
ATC.CODE; alpha-numeric code to identify the medication)
- dosage (
DAILY.DOSE; prescribed dose of the medication per day), and
- unit (
UNIT; unit of the prescribed dose).
# Load example prescription data:
load("./AdhereR Tutorial/data/example_presc_events.RData")
str(presc_events)
Classes ‘data.table’ and ‘data.frame’: 47 obs. of 5 variables: $ PATIENT.ID: int 43 45 46 45 46 47 43 46 47 45 … $ DATE.PRESC: Date, format: “2014-01-01” “2014-01-01” … $ ATC.CODE : chr “J01CR02” “J02AA01” “J01DF01” “J01DC02” … $ DAILY.DOSE: num 1300 12 75 500 1000 1000 1000 1000 1000 1000000 … $ UNIT : Factor w/ 3 levels “MG”,“MICROG”,..: 1 1 1 1 1 1 3 3 3 3 … - attr(*, “.internal.selfref”)=
summary(presc_events)
PATIENT.ID DATE.PRESC ATC.CODE
Min. :42.00 Min. :2014-01-01 Length:47
1st Qu.:45.00 1st Qu.:2014-01-01 Class :character
Median :45.00 Median :2014-01-01 Mode :character
Mean :45.21 Mean :2014-01-01
3rd Qu.:46.00 3rd Qu.:2014-01-01
Max. :47.00 Max. :2014-01-01
DAILY.DOSE UNIT
Min. : 0.1 MG :27
1st Qu.: 100.0 MICROG: 8
Median : 500.0 UI :12
Mean : 65161.9
3rd Qu.: 1400.0
Max. :1000000.0
Conveniently, this dataset is already clean and all the data is in the right format. Moreover, there is only one prescription event per medication, which occured before the first dispensing event, so we don’t have to deal with prescription changes during our follow-up period.
We can now merge the two datasets and calculate the duration for each dispensing event. We merge by PATIENT.ID, ATC.CODE code and UNIT to make sure that events are matched correctly. This is why it was necessary to clean up the units: Otherwise, some events might not merge correctly due to mismatches between the units. By default, the merge function only includes rows where the ID-variable are present in both instances. This means that we only capture medications that were prescribed and at least once dispensed during the follow-up period. If we want to capture all events, we can specify all = TRUE in the function arguments:
# Merge dispensing and prescription data:
med_events <- merge(disp_events, presc_events, by = c("PATIENT.ID", "ATC.CODE", "UNIT"), all = TRUE, sort = FALSE)
# Calculate the supply duration
med_events[,DURATION := (DOSE.num*QUANTITY)/DAILY.DOSE]
summary(med_events)
PATIENT.ID ATC.CODE UNIT DATE.DISP
Min. :42.00 Length:370 MICROG: 48 Min. :2014-07-01
1st Qu.:45.00 Class :character MG :237 1st Qu.:2015-01-07
Median :45.00 Mode :character UI : 85 Median :2015-05-12
Mean :44.92 Mean :2015-06-19
3rd Qu.:46.00 3rd Qu.:2016-01-08
Max. :47.00 Max. :2016-07-09
NA’s :4
QUANTITY DOSE.num DATE.PRESC
Min. : 1.00 Min. : 2.0 Min. :2014-01-01
1st Qu.: 28.00 1st Qu.: 56.2 1st Qu.:2014-01-01
Median : 60.00 Median : 200.0 Median :2014-01-01
Mean : 99.83 Mean : 17556.4 Mean :2014-01-01
3rd Qu.:120.00 3rd Qu.: 800.0 3rd Qu.:2014-01-01
Max. :672.00 Max. :1000000.0 Max. :2014-01-01
NA’s :4 NA’s :4 NA’s :52
DAILY.DOSE DURATION
Min. : 0.1 Min. : 3.00
1st Qu.: 100.0 1st Qu.: 20.00
Median : 800.0 Median : 30.00
Mean : 72616.4 Mean : 35.51
3rd Qu.: 1785.0 3rd Qu.: 30.00
Max. :1000000.0 Max. :461.54
NA’s :52 NA’s :56
When assessing adherence in clinical practice over an extended period of time, unlike in our example, there will be many events affecting adherence estimation, such as prescription changes or hospitalizations. These data are increasingly available for research, but can be tricky to process correctly. In its newest version, AdhereR now offers a function to link dispensing, prescription, and hospitalization data to improve the accuracy of adherence estimation. For each dispensing event, it automatically selects the last prescibed dose to calculate supply duration, checks for prescription changes and hospitalizations during this period, and adjusts the duration accordingly. It requires the following input:
x : A data.frame with the dispensing data
y : A data.frame with the prescription data
z : optional, a data.frame with the hospitalization data
ID.var : A character vector of the ID column (identical in all data sources)
DATE.PRESC.var : A character vector of the prescription date column (in y)
DATE.DISP.var : A character vector of the prescription date column (in x)
DATE.format : A character vector of the date format (identical in all data sources)
CATEGORY.var : A character vector of the medication identification column (identical for x and y)
TOTAL.DOSE.var : A numeric vector of the column with the dispensed dose (in x)
DAILY.DOSE.var : A numeric vector of the column with the daily prescribed dose (in y)
PRESC.DURATION.var : : optional, A integer vector of the column with the prescription duration in days (in y)
UNIT.var : optional, A character vector of the medication unit column (identical for x and y)
FORM.var : optional, A character vector of the medication form column (identical for x and y)
VISIT.var : optional, A integer vector of the visit number (in y)
force.init.presc : logical, default TRUE; should first prescibed dose be used for dispensing events occuring before the first prescription event?
force.presc.renew : logical, default TRUE; if a medication has not been prescribed during a prescription event, should its prescription end on this date?
consider.dosage.change : logical, default TRUE; should the supply duration be recalculated in case of prescription changes?
Because hospitalization data is optional, we can use this function with our example datasets:
# Merge dispensing and prescription data with AdhereR's medication_match function:
##### Fill in the arguments for the function call #######
In addition to the standard AdhereR columns, the output of the matching function contains some more columns with additional information:
FIRST.PRESC : A Date column with the date when the treatment was first prescribed
PRESC.START : A Date column with the start date of a prescription episode
PRESC.END : A Date column with the end date of a prescription episode. If there is no end date, this will be NA
DOSAGE.CHANGE : An integer column with the number of dosage changes considered for a given dispensing event.
Before calculating adherence, we check the results of our matching function for plausibility.
# Check output of matching function
str(NULL)
NULL
summary(NULL)
Length Class Mode 0 NULL NULL ## Visualization of patient records
A first step towards deciding which algorithm is appropriate for these data is to explore medication histories visually. We do this by creating an object of type CMA0 for the two example patients, and plotting it. This type of plots can of course be created for a much bigger subsample of patients and saved as as a JPEG, PNG, TIFF, EPS or PDF file using R’s plotting system for data exploration.
# Create an object "cma0" of the most basic CMA type, "CMA0":
cma0 <- CMA0(data=med.events, # use the two selected patients
ID.colname="PATIENT_ID", # the name of the column containing the IDs
event.date.colname="DATE", # the name of the column containing the event date
event.duration.colname="DURATION", # the name of the column containing the duration
event.daily.dose.colname="PERDAY", # the name of the column containing the dosage
medication.class.colname="CATEGORY", # the name of the column containing the category
followup.window.start=0, # FUW start in days since earliest event
observation.window.start=182, # OW start in days since earliest event
observation.window.duration=365, # OW duration in days
date.format="%m/%d/%Y"); # date format (mm/dd/yyyy)
# Plot the object (CMA0 shows the actual event data only):
plot(cma0, # the object to plot
align.all.patients=TRUE); # align all patients for easier comparison
We can see that patient 76 had an interruption of more than 100 days between the second and third medB supply and several situations of new supply acquired while the previous supply was not exhausted. Patient 37 had shorter gaps between consecutive events, but very little overlap in supplies. For patient 76, the switch to medB happened while the medA supply was still available, then a switch back to medA happened later, at the end of the second year. For patient 37, there was a single medication switch (to medB) without an overlap at that point.
These observations highlight several decision points in calculating persistence and adherence, which need to be informed by the clinical context of the study:
- what OW is relevant for calculating adherence and persistence? Both patients have been on treatment during the 2 years, they had a substantial number of events of relatively short duration, and variable delays between the end of a supply and the next event. Thus, their adherence might have oscillated substantially during this period. We could compute adherence and/or persistence for the full 2-year period, or consider shorter intervals;
- is the largest interruption seen in patient
76 an indication of non-persistence, or of lower adherence over that time interval? If the medication is likely to be used rarely despite daily use recommendations, such an interval might indicate a period of low adherence. If usual adherence rates are close to 100% when used, that delay is likely to indicate a treatment gap and needs to be treated as such, and the last 2 events as reinitiation of treatment (new treatment episode);
- is the switch from
medA to medB an indicator of a new treatment episode? If medA and medB are two alternative formulations of the same chemical molecule, there might be clinical arguments for considering them as part of the same treatment episode (e.g. the pharmacist provided an alternative option to a product unavailable at the moment). If they are two distinct drug classes with different mechanisms of action and recommendations of use, it may be more appropriate to consider that patient 76 has had 3 treatment episodes and patient 37 one episode;
- is it necessary to consider carry-over of oversupply from previous events? For patient
37 this seems to matter very little, as there is little overlap between event durations, but patient 76 has substantial overlaps. If available medication is not likely to be either overused or discarded at every new medication event, it is important to control for carry-over;
- it is necessary to consider carry-over also when medication changes? Patient
76 has changed from medA to medB while still having a large supply of medA available. Was the patient more likely to discard the remaining medA the moment of receiving medB or finish it before starting the medB supply? If they are two alternative formulations and medB was (for example) given because medA was not in stock at the moment, probably this came with a recommendation to finish the available supply. If they are two distinct drug classes and the switch happens usually after assessment of therapeutic versus side effects, probably this came with a recommendation to stop using medA.
These decisions therefore need to be taken based on a good understanding of the pharmacological properties of the medication studied, and the most plausible clinical decision-making in routine care. This information can be collected from an advisory committee with relevant expertise (e.g. based on consensus protocols), or (even better) qualitative or survey research on the routine practices in prescribing, dispensing and using that specific medication. Of course, this is not always possible – a second-best option (or even complementary option, if consensus is not reached) is to compare systematically the effects of different analysis choices on the hypotheses tested (e.g. as sensitivity analyses).